{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 用每日新闻预测金融市场变化（标准版）\n",
    "\n",
    "TF-IDF + SVM 是文本分类问题的基准线\n",
    "\n",
    "这篇教程我直接用最简单直接的方式处理。\n",
    "\n",
    "高级版本的教程会在日后的课程中放出。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from datetime import date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 监视数据\n",
    "\n",
    "我们先读入数据。这里我提供了一个已经combine好了的数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv('../input/Combined_News_DJIA.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这时候，我们可以看一下数据长什么样子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Label</th>\n",
       "      <th>Top1</th>\n",
       "      <th>Top2</th>\n",
       "      <th>Top3</th>\n",
       "      <th>Top4</th>\n",
       "      <th>Top5</th>\n",
       "      <th>Top6</th>\n",
       "      <th>Top7</th>\n",
       "      <th>Top8</th>\n",
       "      <th>...</th>\n",
       "      <th>Top16</th>\n",
       "      <th>Top17</th>\n",
       "      <th>Top18</th>\n",
       "      <th>Top19</th>\n",
       "      <th>Top20</th>\n",
       "      <th>Top21</th>\n",
       "      <th>Top22</th>\n",
       "      <th>Top23</th>\n",
       "      <th>Top24</th>\n",
       "      <th>Top25</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2008-08-08</td>\n",
       "      <td>0</td>\n",
       "      <td>b\"Georgia 'downs two Russian warplanes' as cou...</td>\n",
       "      <td>b'BREAKING: Musharraf to be impeached.'</td>\n",
       "      <td>b'Russia Today: Columns of troops roll into So...</td>\n",
       "      <td>b'Russian tanks are moving towards the capital...</td>\n",
       "      <td>b\"Afghan children raped with 'impunity,' U.N. ...</td>\n",
       "      <td>b'150 Russian tanks have entered South Ossetia...</td>\n",
       "      <td>b\"Breaking: Georgia invades South Ossetia, Rus...</td>\n",
       "      <td>b\"The 'enemy combatent' trials are nothing but...</td>\n",
       "      <td>...</td>\n",
       "      <td>b'Georgia Invades South Ossetia - if Russia ge...</td>\n",
       "      <td>b'Al-Qaeda Faces Islamist Backlash'</td>\n",
       "      <td>b'Condoleezza Rice: \"The US would not act to p...</td>\n",
       "      <td>b'This is a busy day:  The European Union has ...</td>\n",
       "      <td>b\"Georgia will withdraw 1,000 soldiers from Ir...</td>\n",
       "      <td>b'Why the Pentagon Thinks Attacking Iran is a ...</td>\n",
       "      <td>b'Caucasus in crisis: Georgia invades South Os...</td>\n",
       "      <td>b'Indian shoe manufactory  - And again in a se...</td>\n",
       "      <td>b'Visitors Suffering from Mental Illnesses Ban...</td>\n",
       "      <td>b\"No Help for Mexico's Kidnapping Surge\"</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2008-08-11</td>\n",
       "      <td>1</td>\n",
       "      <td>b'Why wont America and Nato help us? If they w...</td>\n",
       "      <td>b'Bush puts foot down on Georgian conflict'</td>\n",
       "      <td>b\"Jewish Georgian minister: Thanks to Israeli ...</td>\n",
       "      <td>b'Georgian army flees in disarray as Russians ...</td>\n",
       "      <td>b\"Olympic opening ceremony fireworks 'faked'\"</td>\n",
       "      <td>b'What were the Mossad with fraudulent New Zea...</td>\n",
       "      <td>b'Russia angered by Israeli military sale to G...</td>\n",
       "      <td>b'An American citizen living in S.Ossetia blam...</td>\n",
       "      <td>...</td>\n",
       "      <td>b'Israel and the US behind the Georgian aggres...</td>\n",
       "      <td>b'\"Do not believe TV, neither Russian nor Geor...</td>\n",
       "      <td>b'Riots are still going on in Montreal (Canada...</td>\n",
       "      <td>b'China to overtake US as largest manufacturer'</td>\n",
       "      <td>b'War in South Ossetia [PICS]'</td>\n",
       "      <td>b'Israeli Physicians Group Condemns State Tort...</td>\n",
       "      <td>b' Russia has just beaten the United States ov...</td>\n",
       "      <td>b'Perhaps *the* question about the Georgia - R...</td>\n",
       "      <td>b'Russia is so much better at war'</td>\n",
       "      <td>b\"So this is what it's come to: trading sex fo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2008-08-12</td>\n",
       "      <td>0</td>\n",
       "      <td>b'Remember that adorable 9-year-old who sang a...</td>\n",
       "      <td>b\"Russia 'ends Georgia operation'\"</td>\n",
       "      <td>b'\"If we had no sexual harassment we would hav...</td>\n",
       "      <td>b\"Al-Qa'eda is losing support in Iraq because ...</td>\n",
       "      <td>b'Ceasefire in Georgia: Putin Outmaneuvers the...</td>\n",
       "      <td>b'Why Microsoft and Intel tried to kill the XO...</td>\n",
       "      <td>b'Stratfor: The Russo-Georgian War and the Bal...</td>\n",
       "      <td>b\"I'm Trying to Get a Sense of This Whole Geor...</td>\n",
       "      <td>...</td>\n",
       "      <td>b'U.S. troops still in Georgia (did you know t...</td>\n",
       "      <td>b'Why Russias response to Georgia was right'</td>\n",
       "      <td>b'Gorbachev accuses U.S. of making a \"serious ...</td>\n",
       "      <td>b'Russia, Georgia, and NATO: Cold War Two'</td>\n",
       "      <td>b'Remember that adorable 62-year-old who led y...</td>\n",
       "      <td>b'War in Georgia: The Israeli connection'</td>\n",
       "      <td>b'All signs point to the US encouraging Georgi...</td>\n",
       "      <td>b'Christopher King argues that the US and NATO...</td>\n",
       "      <td>b'America: The New Mexico?'</td>\n",
       "      <td>b\"BBC NEWS | Asia-Pacific | Extinction 'by man...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2008-08-13</td>\n",
       "      <td>0</td>\n",
       "      <td>b' U.S. refuses Israel weapons to attack Iran:...</td>\n",
       "      <td>b\"When the president ordered to attack Tskhinv...</td>\n",
       "      <td>b' Israel clears troops who killed Reuters cam...</td>\n",
       "      <td>b'Britain\\'s policy of being tough on drugs is...</td>\n",
       "      <td>b'Body of 14 year old found in trunk; Latest (...</td>\n",
       "      <td>b'China has moved 10 *million* quake survivors...</td>\n",
       "      <td>b\"Bush announces Operation Get All Up In Russi...</td>\n",
       "      <td>b'Russian forces sink Georgian ships '</td>\n",
       "      <td>...</td>\n",
       "      <td>b'Elephants extinct by 2020?'</td>\n",
       "      <td>b'US humanitarian missions soon in Georgia - i...</td>\n",
       "      <td>b\"Georgia's DDOS came from US sources\"</td>\n",
       "      <td>b'Russian convoy heads into Georgia, violating...</td>\n",
       "      <td>b'Israeli defence minister: US against strike ...</td>\n",
       "      <td>b'Gorbachev: We Had No Choice'</td>\n",
       "      <td>b'Witness: Russian forces head towards Tbilisi...</td>\n",
       "      <td>b' Quarter of Russians blame U.S. for conflict...</td>\n",
       "      <td>b'Georgian president  says US military will ta...</td>\n",
       "      <td>b'2006: Nobel laureate Aleksander Solzhenitsyn...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2008-08-14</td>\n",
       "      <td>1</td>\n",
       "      <td>b'All the experts admit that we should legalis...</td>\n",
       "      <td>b'War in South Osetia - 89 pictures made by a ...</td>\n",
       "      <td>b'Swedish wrestler Ara Abrahamian throws away ...</td>\n",
       "      <td>b'Russia exaggerated the death toll in South O...</td>\n",
       "      <td>b'Missile That Killed 9 Inside Pakistan May Ha...</td>\n",
       "      <td>b\"Rushdie Condemns Random House's Refusal to P...</td>\n",
       "      <td>b'Poland and US agree to missle defense deal. ...</td>\n",
       "      <td>b'Will the Russians conquer Tblisi? Bet on it,...</td>\n",
       "      <td>...</td>\n",
       "      <td>b'Bank analyst forecast Georgian crisis 2 days...</td>\n",
       "      <td>b\"Georgia confict could set back Russia's US r...</td>\n",
       "      <td>b'War in the Caucasus is as much the product o...</td>\n",
       "      <td>b'\"Non-media\" photos of South Ossetia/Georgia ...</td>\n",
       "      <td>b'Georgian TV reporter shot by Russian sniper ...</td>\n",
       "      <td>b'Saudi Arabia: Mother moves to block child ma...</td>\n",
       "      <td>b'Taliban wages war on humanitarian aid workers'</td>\n",
       "      <td>b'Russia: World  \"can forget about\" Georgia\\'s...</td>\n",
       "      <td>b'Darfur rebels accuse Sudan of mounting major...</td>\n",
       "      <td>b'Philippines : Peace Advocate say Muslims nee...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date  Label                                               Top1  \\\n",
       "0  2008-08-08      0  b\"Georgia 'downs two Russian warplanes' as cou...   \n",
       "1  2008-08-11      1  b'Why wont America and Nato help us? If they w...   \n",
       "2  2008-08-12      0  b'Remember that adorable 9-year-old who sang a...   \n",
       "3  2008-08-13      0  b' U.S. refuses Israel weapons to attack Iran:...   \n",
       "4  2008-08-14      1  b'All the experts admit that we should legalis...   \n",
       "\n",
       "                                                Top2  \\\n",
       "0            b'BREAKING: Musharraf to be impeached.'   \n",
       "1        b'Bush puts foot down on Georgian conflict'   \n",
       "2                 b\"Russia 'ends Georgia operation'\"   \n",
       "3  b\"When the president ordered to attack Tskhinv...   \n",
       "4  b'War in South Osetia - 89 pictures made by a ...   \n",
       "\n",
       "                                                Top3  \\\n",
       "0  b'Russia Today: Columns of troops roll into So...   \n",
       "1  b\"Jewish Georgian minister: Thanks to Israeli ...   \n",
       "2  b'\"If we had no sexual harassment we would hav...   \n",
       "3  b' Israel clears troops who killed Reuters cam...   \n",
       "4  b'Swedish wrestler Ara Abrahamian throws away ...   \n",
       "\n",
       "                                                Top4  \\\n",
       "0  b'Russian tanks are moving towards the capital...   \n",
       "1  b'Georgian army flees in disarray as Russians ...   \n",
       "2  b\"Al-Qa'eda is losing support in Iraq because ...   \n",
       "3  b'Britain\\'s policy of being tough on drugs is...   \n",
       "4  b'Russia exaggerated the death toll in South O...   \n",
       "\n",
       "                                                Top5  \\\n",
       "0  b\"Afghan children raped with 'impunity,' U.N. ...   \n",
       "1      b\"Olympic opening ceremony fireworks 'faked'\"   \n",
       "2  b'Ceasefire in Georgia: Putin Outmaneuvers the...   \n",
       "3  b'Body of 14 year old found in trunk; Latest (...   \n",
       "4  b'Missile That Killed 9 Inside Pakistan May Ha...   \n",
       "\n",
       "                                                Top6  \\\n",
       "0  b'150 Russian tanks have entered South Ossetia...   \n",
       "1  b'What were the Mossad with fraudulent New Zea...   \n",
       "2  b'Why Microsoft and Intel tried to kill the XO...   \n",
       "3  b'China has moved 10 *million* quake survivors...   \n",
       "4  b\"Rushdie Condemns Random House's Refusal to P...   \n",
       "\n",
       "                                                Top7  \\\n",
       "0  b\"Breaking: Georgia invades South Ossetia, Rus...   \n",
       "1  b'Russia angered by Israeli military sale to G...   \n",
       "2  b'Stratfor: The Russo-Georgian War and the Bal...   \n",
       "3  b\"Bush announces Operation Get All Up In Russi...   \n",
       "4  b'Poland and US agree to missle defense deal. ...   \n",
       "\n",
       "                                                Top8  \\\n",
       "0  b\"The 'enemy combatent' trials are nothing but...   \n",
       "1  b'An American citizen living in S.Ossetia blam...   \n",
       "2  b\"I'm Trying to Get a Sense of This Whole Geor...   \n",
       "3             b'Russian forces sink Georgian ships '   \n",
       "4  b'Will the Russians conquer Tblisi? Bet on it,...   \n",
       "\n",
       "                         ...                          \\\n",
       "0                        ...                           \n",
       "1                        ...                           \n",
       "2                        ...                           \n",
       "3                        ...                           \n",
       "4                        ...                           \n",
       "\n",
       "                                               Top16  \\\n",
       "0  b'Georgia Invades South Ossetia - if Russia ge...   \n",
       "1  b'Israel and the US behind the Georgian aggres...   \n",
       "2  b'U.S. troops still in Georgia (did you know t...   \n",
       "3                      b'Elephants extinct by 2020?'   \n",
       "4  b'Bank analyst forecast Georgian crisis 2 days...   \n",
       "\n",
       "                                               Top17  \\\n",
       "0                b'Al-Qaeda Faces Islamist Backlash'   \n",
       "1  b'\"Do not believe TV, neither Russian nor Geor...   \n",
       "2       b'Why Russias response to Georgia was right'   \n",
       "3  b'US humanitarian missions soon in Georgia - i...   \n",
       "4  b\"Georgia confict could set back Russia's US r...   \n",
       "\n",
       "                                               Top18  \\\n",
       "0  b'Condoleezza Rice: \"The US would not act to p...   \n",
       "1  b'Riots are still going on in Montreal (Canada...   \n",
       "2  b'Gorbachev accuses U.S. of making a \"serious ...   \n",
       "3             b\"Georgia's DDOS came from US sources\"   \n",
       "4  b'War in the Caucasus is as much the product o...   \n",
       "\n",
       "                                               Top19  \\\n",
       "0  b'This is a busy day:  The European Union has ...   \n",
       "1    b'China to overtake US as largest manufacturer'   \n",
       "2         b'Russia, Georgia, and NATO: Cold War Two'   \n",
       "3  b'Russian convoy heads into Georgia, violating...   \n",
       "4  b'\"Non-media\" photos of South Ossetia/Georgia ...   \n",
       "\n",
       "                                               Top20  \\\n",
       "0  b\"Georgia will withdraw 1,000 soldiers from Ir...   \n",
       "1                     b'War in South Ossetia [PICS]'   \n",
       "2  b'Remember that adorable 62-year-old who led y...   \n",
       "3  b'Israeli defence minister: US against strike ...   \n",
       "4  b'Georgian TV reporter shot by Russian sniper ...   \n",
       "\n",
       "                                               Top21  \\\n",
       "0  b'Why the Pentagon Thinks Attacking Iran is a ...   \n",
       "1  b'Israeli Physicians Group Condemns State Tort...   \n",
       "2          b'War in Georgia: The Israeli connection'   \n",
       "3                     b'Gorbachev: We Had No Choice'   \n",
       "4  b'Saudi Arabia: Mother moves to block child ma...   \n",
       "\n",
       "                                               Top22  \\\n",
       "0  b'Caucasus in crisis: Georgia invades South Os...   \n",
       "1  b' Russia has just beaten the United States ov...   \n",
       "2  b'All signs point to the US encouraging Georgi...   \n",
       "3  b'Witness: Russian forces head towards Tbilisi...   \n",
       "4   b'Taliban wages war on humanitarian aid workers'   \n",
       "\n",
       "                                               Top23  \\\n",
       "0  b'Indian shoe manufactory  - And again in a se...   \n",
       "1  b'Perhaps *the* question about the Georgia - R...   \n",
       "2  b'Christopher King argues that the US and NATO...   \n",
       "3  b' Quarter of Russians blame U.S. for conflict...   \n",
       "4  b'Russia: World  \"can forget about\" Georgia\\'s...   \n",
       "\n",
       "                                               Top24  \\\n",
       "0  b'Visitors Suffering from Mental Illnesses Ban...   \n",
       "1                 b'Russia is so much better at war'   \n",
       "2                        b'America: The New Mexico?'   \n",
       "3  b'Georgian president  says US military will ta...   \n",
       "4  b'Darfur rebels accuse Sudan of mounting major...   \n",
       "\n",
       "                                               Top25  \n",
       "0           b\"No Help for Mexico's Kidnapping Surge\"  \n",
       "1  b\"So this is what it's come to: trading sex fo...  \n",
       "2  b\"BBC NEWS | Asia-Pacific | Extinction 'by man...  \n",
       "3  b'2006: Nobel laureate Aleksander Solzhenitsyn...  \n",
       "4  b'Philippines : Peace Advocate say Muslims nee...  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其实看起来特别的简单直观。如果是1，那么当日的DJIA就提高或者不变了。如果是1，那么DJIA那天就是跌了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来我们把headlines先合并起来。因为我们显然是需要考虑所有的news的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data[\"combined_news\"] = data.filter(regex=(\"Top.*\")).apply(lambda x: ''.join(str(x.values)), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 分割测试/训练集\n",
    "\n",
    "这下，我们可以把数据给分成Training/Testing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train = data[data['Date'] < '2015-01-01']\n",
    "test = data[data['Date'] > '2014-12-31']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 提取features\n",
    "\n",
    "也就是文本中的特征\n",
    "这里一定要注意，fit你的model的时候，要用training set，不能一股脑的把所有的数据都放进来。（当然，现实中你可以这么做）因为我们要假设testing set我们在训练的时候是完全接触不到的，是不可知的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "feature_extraction = TfidfVectorizer()\n",
    "X_train = feature_extraction.fit_transform(train[\"combined_news\"].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "所以*X_train* 搞完以后，直接给 *X_test* 做个Transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_test = feature_extraction.transform(test[\"combined_news\"].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同理，y就是我们已经准备好的label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_train = train[\"Label\"].values\n",
    "y_test = test[\"Label\"].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "clf = SVC(probability=True, kernel='rbf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把你的*X_train*和*y_train*给fit进去"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=True, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictions = clf.predict_proba(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 验证准确度\n",
    "\n",
    "按照我之前给的要求，用AUC作为binary classification的Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC-AUC yields 0.574260752688\n"
     ]
    }
   ],
   "source": [
    "print('ROC-AUC yields ' + str(roc_auc_score(y_test, predictions[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 进阶版\n",
    "\n",
    "### 文本预处理\n",
    "\n",
    "我们这样直接把文本放进TF-IDF，虽然简单方便，但是还是不够严谨的。\n",
    "我们可以把原文本做进一步的处理。\n",
    "\n",
    "+ 小写 / 分成小tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['[', 'most', 'cases', 'of', 'cancer', 'are', 'the', 'result', 'of', 'sheer', 'bad', 'luck', 'rather', 'than', 'unhealthy', 'lifestyles,', 'diet', 'or', 'even', 'inherited', 'genes,', 'new', 'research', 'suggests.', 'random', 'mutations', 'that', 'occur', 'in', 'dna', 'when', 'cells', 'divide', 'are', 'responsible', 'for', 'two', 'thirds', 'of', 'adult', 'cancers', 'across', 'a', 'wide', 'range', 'of', 'tissues.', 'iran', 'dismissed', 'united', 'states', 'efforts', 'to', 'fight', 'islamic', 'state', 'as', 'a', 'ploy', 'to', 'advance', 'u.s.', 'policies', 'in', 'the', 'region:', 'the', 'reality', 'is', 'that', 'the', 'united', 'states', 'is', 'not', 'acting', 'to', 'eliminate', 'daesh.', 'they', 'are', 'not', 'even', 'interested', 'in', 'weakening', 'daesh,', 'they', 'are', 'only', 'interested', 'in', 'managing', 'it', 'poll:', 'one', 'in', '8', 'germans', 'would', 'join', 'anti-muslim', 'marches', 'uk', 'royal', 'familys', 'prince', 'andrew', 'named', 'in', 'us', 'lawsuit', 'over', 'underage', 'sex', 'allegations', 'some', '40', 'asylum-seekers', 'refused', 'to', 'leave', 'the', 'bus', 'when', 'they', 'arrived', 'at', 'their', 'destination', 'in', 'rural', 'northern', 'sweden,', 'demanding', 'that', 'they', 'be', 'taken', 'back', 'to', 'malm', 'or', 'some', 'big', 'city.', 'pakistani', 'boat', 'blows', 'self', 'up', 'after', 'india', 'navy', 'chase.', 'all', 'four', 'people', 'on', 'board', 'the', 'vessel', 'from', 'near', 'the', 'pakistani', 'port', 'city', 'of', 'karachi', 'are', 'believed', 'to', 'have', 'been', 'killed', 'in', 'the', 'dramatic', 'episode', 'in', 'the', 'arabian', 'sea', 'on', 'new', 'years', 'eve,', 'according', 'to', 'indias', 'defence', 'ministry.', 'sweden', 'hit', 'by', 'third', 'mosque', 'arson', 'attack', 'in', 'a', 'week', '940', 'cars', 'set', 'alight', 'during', 'french', 'new', 'year', 'salaries', 'for', 'top', 'ceos', 'rose', 'twice', 'as', 'fast', 'as', 'average', 'canadian', 'since', 'recession:', 'study', 'norway', 'violated', 'equal-pay', 'law,', 'judge', 'says:', 'judge', 'finds', 'consulate', 'employee', 'was', 'unjustly', 'paid', '$30,000', 'less', 'than', 'her', 'male', 'counterpart', 'imam', 'wants', 'radical', 'recruiters', 'of', 'muslim', 'youth', 'in', 'canada', 'identified', 'and', 'dealt', 'with', 'saudi', 'arabia', 'beheaded', '83', 'people', 'in', '2014,', 'the', 'most', 'in', 'years', 'a', 'living', 'hell', 'for', 'slaves', 'on', 'remote', 'south', 'korean', 'islands', '-', 'slavery', 'thrives', 'on', 'this', 'chain', 'of', 'rural', 'islands', 'off', 'south', 'koreas', 'rugged', 'southwest', 'coast,', 'nurtured', 'by', 'a', 'long', 'history', 'of', 'exploitation', 'and', 'the', 'demands', 'of', 'trying', 'to', 'squeeze', 'a', 'living', 'from', 'the', 'sea.', 'worlds', '400', 'richest', 'get', 'richer,', 'adding', '$92bn', 'in', '2014', 'rental', 'car', 'stereos', 'infringe', 'copyright,', 'music', 'rights', 'group', 'says', 'ukrainian', 'minister', 'threatens', 'tv', 'channel', 'with', 'closure', 'for', 'airing', 'russian', 'entertainers', 'palestinian', 'president', 'mahmoud', 'abbas', 'has', 'entered', 'into', 'his', 'most', 'serious', 'confrontation', 'yet', 'with', 'israel', 'by', 'signing', 'onto', 'the', 'international', 'criminal', 'court.', 'his', 'decision', 'on', 'wednesday', 'gives', 'the', 'court', 'jurisdiction', 'over', 'crimes', 'committed', 'in', 'palestinian', 'lands.', 'israeli', 'security', 'center', 'publishes', 'names', 'of', '50', 'killed', 'terrorists', 'concealed', 'by', 'hamas', 'the', 'year', '2014', 'was', 'the', 'deadliest', 'year', 'yet', 'in', 'syrias', 'four-year', 'conflict,', 'with', 'over', '76,000', 'killed', 'a', 'secret', 'underground', 'complex', 'built', 'by', 'the', 'nazis', 'that', 'may', 'have', 'been', 'used', 'for', 'the', 'development', 'of', 'wmds,', 'including', 'a', 'nuclear', 'bomb,', 'has', 'been', 'uncovered', 'in', 'austria.', 'restrictions', 'on', 'web', 'freedom', 'a', 'major', 'global', 'issue', 'in', '2015', 'austrian', 'journalist', 'erich', 'mchel', 'delivered', 'a', 'presentation', 'in', 'hamburg', 'at', 'the', 'annual', 'meeting', 'of', 'the', 'chaos', 'computer', 'club', 'on', 'monday', 'december', '29,', 'detailing', 'the', 'various', 'locations', 'where', 'the', 'us', 'nsa', 'has', 'been', 'actively', 'collecting', 'and', 'processing', 'electronic', 'intelligence', 'in', 'vienna.', 'thousands', 'of', 'ukraine', 'nationalists', 'march', 'in', 'kiev', 'chinas', 'new', 'years', 'resolution:', 'no', 'more', 'harvesting', 'executed', 'prisoners', 'organs', 'authorities', 'pull', 'plug', 'on', 'russias', 'last', 'politically', 'independent', 'tv', 'station]']\n"
     ]
    }
   ],
   "source": [
    "X_train = train[\"combined_news\"].str.lower().str.replace('\"', '').str.replace(\"'\", '').str.split()\n",
    "X_test = test[\"combined_news\"].str.lower().str.replace('\"', '').str.replace(\"'\", '').str.split()\n",
    "print(X_test[1611])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 删减停止词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import stopwords\n",
    "stop = stopwords.words('english')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ 删除数字"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re\n",
    "def hasNumbers(inputString):\n",
    "    return bool(re.search(r'\\d', inputString))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "+ lemma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.stem import WordNetLemmatizer\n",
    "wordnet_lemmatizer = WordNetLemmatizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们把这些元素全都合成一个func"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def check(word):\n",
    "    \"\"\"\n",
    "    如果需要这个单词，则True\n",
    "    如果应该去除，则False\n",
    "    \"\"\"\n",
    "    if word in stop:\n",
    "        return False\n",
    "    elif hasNumbers(word):\n",
    "        return False\n",
    "    else:\n",
    "        return True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们把整个流程放进我们的DF中处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['[', 'case', 'cancer', 'result', 'sheer', 'bad', 'luck', 'rather', 'unhealthy', 'lifestyles,', 'diet', 'even', 'inherited', 'genes,', 'new', 'research', 'suggests.', 'random', 'mutation', 'occur', 'dna', 'cell', 'divide', 'responsible', 'two', 'third', 'adult', 'cancer', 'across', 'wide', 'range', 'tissues.', 'iran', 'dismissed', 'united', 'state', 'effort', 'fight', 'islamic', 'state', 'ploy', 'advance', 'u.s.', 'policy', 'region:', 'reality', 'united', 'state', 'acting', 'eliminate', 'daesh.', 'even', 'interested', 'weakening', 'daesh,', 'interested', 'managing', 'poll:', 'one', 'german', 'would', 'join', 'anti-muslim', 'march', 'uk', 'royal', 'family', 'prince', 'andrew', 'named', 'u', 'lawsuit', 'underage', 'sex', 'allegation', 'asylum-seekers', 'refused', 'leave', 'bus', 'arrived', 'destination', 'rural', 'northern', 'sweden,', 'demanding', 'taken', 'back', 'malm', 'big', 'city.', 'pakistani', 'boat', 'blow', 'self', 'india', 'navy', 'chase.', 'four', 'people', 'board', 'vessel', 'near', 'pakistani', 'port', 'city', 'karachi', 'believed', 'killed', 'dramatic', 'episode', 'arabian', 'sea', 'new', 'year', 'eve,', 'according', 'india', 'defence', 'ministry.', 'sweden', 'hit', 'third', 'mosque', 'arson', 'attack', 'week', 'car', 'set', 'alight', 'french', 'new', 'year', 'salary', 'top', 'ceo', 'rose', 'twice', 'fast', 'average', 'canadian', 'since', 'recession:', 'study', 'norway', 'violated', 'equal-pay', 'law,', 'judge', 'says:', 'judge', 'find', 'consulate', 'employee', 'unjustly', 'paid', 'le', 'male', 'counterpart', 'imam', 'want', 'radical', 'recruiter', 'muslim', 'youth', 'canada', 'identified', 'dealt', 'saudi', 'arabia', 'beheaded', 'people', 'year', 'living', 'hell', 'slave', 'remote', 'south', 'korean', 'island', '-', 'slavery', 'thrives', 'chain', 'rural', 'island', 'south', 'korea', 'rugged', 'southwest', 'coast,', 'nurtured', 'long', 'history', 'exploitation', 'demand', 'trying', 'squeeze', 'living', 'sea.', 'world', 'richest', 'get', 'richer,', 'adding', 'rental', 'car', 'stereo', 'infringe', 'copyright,', 'music', 'right', 'group', 'say', 'ukrainian', 'minister', 'threatens', 'tv', 'channel', 'closure', 'airing', 'russian', 'entertainer', 'palestinian', 'president', 'mahmoud', 'abbas', 'entered', 'serious', 'confrontation', 'yet', 'israel', 'signing', 'onto', 'international', 'criminal', 'court.', 'decision', 'wednesday', 'give', 'court', 'jurisdiction', 'crime', 'committed', 'palestinian', 'lands.', 'israeli', 'security', 'center', 'publishes', 'name', 'killed', 'terrorist', 'concealed', 'hamas', 'year', 'deadliest', 'year', 'yet', 'syria', 'four-year', 'conflict,', 'killed', 'secret', 'underground', 'complex', 'built', 'nazi', 'may', 'used', 'development', 'wmds,', 'including', 'nuclear', 'bomb,', 'uncovered', 'austria.', 'restriction', 'web', 'freedom', 'major', 'global', 'issue', 'austrian', 'journalist', 'erich', 'mchel', 'delivered', 'presentation', 'hamburg', 'annual', 'meeting', 'chaos', 'computer', 'club', 'monday', 'december', 'detailing', 'various', 'location', 'u', 'nsa', 'actively', 'collecting', 'processing', 'electronic', 'intelligence', 'vienna.', 'thousand', 'ukraine', 'nationalist', 'march', 'kiev', 'china', 'new', 'year', 'resolution:', 'harvesting', 'executed', 'prisoner', 'organ', 'authority', 'pull', 'plug', 'russia', 'last', 'politically', 'independent', 'tv', 'station]']\n"
     ]
    }
   ],
   "source": [
    "X_train = X_train.apply(lambda x: [wordnet_lemmatizer.lemmatize(item) for item in x if check(item)])\n",
    "X_test = X_test.apply(lambda x: [wordnet_lemmatizer.lemmatize(item) for item in x if check(item)])\n",
    "print(X_test[1611])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为外部库，比如sklearn 只支持string输入，所以我们把调整后的list再变回string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ case cancer result sheer bad luck rather unhealthy lifestyles, diet even inherited genes, new research suggests. random mutation occur dna cell divide responsible two third adult cancer across wide range tissues. iran dismissed united state effort fight islamic state ploy advance u.s. policy region: reality united state acting eliminate daesh. even interested weakening daesh, interested managing poll: one german would join anti-muslim march uk royal family prince andrew named u lawsuit underage sex allegation asylum-seekers refused leave bus arrived destination rural northern sweden, demanding taken back malm big city. pakistani boat blow self india navy chase. four people board vessel near pakistani port city karachi believed killed dramatic episode arabian sea new year eve, according india defence ministry. sweden hit third mosque arson attack week car set alight french new year salary top ceo rose twice fast average canadian since recession: study norway violated equal-pay law, judge says: judge find consulate employee unjustly paid le male counterpart imam want radical recruiter muslim youth canada identified dealt saudi arabia beheaded people year living hell slave remote south korean island - slavery thrives chain rural island south korea rugged southwest coast, nurtured long history exploitation demand trying squeeze living sea. world richest get richer, adding rental car stereo infringe copyright, music right group say ukrainian minister threatens tv channel closure airing russian entertainer palestinian president mahmoud abbas entered serious confrontation yet israel signing onto international criminal court. decision wednesday give court jurisdiction crime committed palestinian lands. israeli security center publishes name killed terrorist concealed hamas year deadliest year yet syria four-year conflict, killed secret underground complex built nazi may used development wmds, including nuclear bomb, uncovered austria. restriction web freedom major global issue austrian journalist erich mchel delivered presentation hamburg annual meeting chaos computer club monday december detailing various location u nsa actively collecting processing electronic intelligence vienna. thousand ukraine nationalist march kiev china new year resolution: harvesting executed prisoner organ authority pull plug russia last politically independent tv station]\n"
     ]
    }
   ],
   "source": [
    "X_train = X_train.apply(lambda x: ' '.join(x))\n",
    "X_test = X_test.apply(lambda x: ' '.join(x))\n",
    "print(X_test[1611])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "重新Fit一遍我们的clf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "feature_extraction = TfidfVectorizer(lowercase=False)\n",
    "X_train = feature_extraction.fit_transform(X_train.values)\n",
    "X_test = feature_extraction.transform(X_test.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "再跑一遍"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ROC-AUC yields 0.539566532258\n"
     ]
    }
   ],
   "source": [
    "clf = SVC(probability=True, kernel='rbf')\n",
    "clf.fit(X_train, y_train)\n",
    "predictions = clf.predict_proba(X_test)\n",
    "print('ROC-AUC yields ' + str(roc_auc_score(y_test, predictions[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不小心发现，折腾一圈以后，这个结果还不如之前的简单版。\n",
    "\n",
    "这个故事告诉我们，**逼装得太多，不一定有卵用。**\n",
    "\n",
    "造成如此的原因有几种：\n",
    "\n",
    "+ 数据点太少\n",
    "\n",
    "在大量的数据下，标准的文本预处理流程还是需要的，以提高机器学习的准确度。\n",
    "\n",
    "+ One-Off result\n",
    "\n",
    "我们到现在都只是跑了一次而已。如果我们像前面的例子一样，用Cross Validation来玩这组数据，说不定我们会发现，分数高的clf其实是overfitted了的。\n",
    "\n",
    "所以，我们在做kaggle竞赛的时候，最好是要给自己的clf做好CV验证。不要贻笑大方。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
